-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement an initial set of uniform DCGM GPU metrics in dcgmreceiver
.
#219
Conversation
receiver/dcgmreceiver/scraper.go
Outdated
s.mb.RecordGpuDcgmPcieIoDataPoint(now, pcieRx, metadata.AttributeNetworkIoDirectionReceive) | ||
} | ||
if metric, ok := metrics["DCGM_FI_PROF_NVLINK_TX_BYTES"]; ok { | ||
nvlinkTx := int64(float64(metric.asInt64()) * (s.config.CollectionInterval.Seconds())) /* rate to delta */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only works if scrape()
is called exactly once per collection interval.
It sounds like the API already gives you timestamps for each point, so you can walk the list of points and integrate each point with the corresponding timestamp. And you probably need to do that from a separate collection goroutine if it needs to happen more often than the collection interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, scrape()
is called exactly once. I agree that the accuracy of the current approach is somewhat suboptimal, but I did fix a few bugs, and wasn't able to get the separate collection goroutine working. We can revisit this on the igorpeshansky-dcgm-new-metrics-parallel branch, but for now I'll stay with one scrape per collection.
|
||
// RecordGpuDcgmNvlinkIoDataPoint adds a data point to gpu.dcgm.nvlink.io metric. | ||
func (mb *MetricsBuilder) RecordGpuDcgmNvlinkIoDataPoint(ts pcommon.Timestamp, val int64, networkIoDirectionAttributeValue AttributeNetworkIoDirection) { | ||
mb.metricGpuDcgmNvlinkIo.recordDataPoint(mb.startTime, ts, val, networkIoDirectionAttributeValue.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is using the wrong start time if these are deltas. What happens when these incorrect delta points get written to Cloud Monitoring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a bug in mdatagen
. I've changed all of these metrics to cumulative, for which this start time is more reasonable.
receiver/dcgmreceiver/util.go
Outdated
@@ -39,44 +48,44 @@ func (m *dcgmMetric) asInt64() int64 { | |||
return *(*int64)(unsafe.Pointer(&m.value[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the name implies, this is unsafe. Accessing a pointer in this way is undefined behavior in Go.
You want int64(binary.NativeEndian.Uint64(m.value[:8]))
, or for processing a whole slice at a time, binary.Read(...)
. Ditto for the three functions above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found an even more robust solution. There is no more usage of unsafe
in this receiver. Can't speak for go-dcgm
, though…
b1525f7
to
3e8243d
Compare
94f6048
to
b00e858
Compare
81f4014
to
4706b7a
Compare
Rearrange aggregates in scraper.
c91caf6
to
b779893
Compare
A follow-on to adb4c2e.
Metrics are now scraped independently of collector polling cycles, allowing higher resolution and better integrations. Co-authored-by: Igor Peshansky <[email protected]>
46b03ce
to
c148e0c
Compare
c148e0c
to
5de8b1c
Compare
This PR implements an initial set of DCGM GPU metrics from the uniform set defined in go/gce-gke-gpu-dcgm-metrics in
dcgmreceiver
.In the process, it changes the GPU device-identifying metric attributes into resource attributes, adds support for older (and newer) GPU devices, and removes the assumption of 1:1 metric/DCGM field correspondence.